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(57) Abstract 



A tonal sound recognizer determines tones in a tonal language without the use of voicing recognizers or peak picking rules. The 
tonal sound recognizer computes feature vectors for a number of segments of a sampled tonal sound signal in a feature vector computing 
device (120), compares the feature vectors of a first of the segments with the feature vectors of another segment in a cross -correlator (130) 
to determine a trend of a movement of a tone of the sampled tonal sound signal and uses the trend as an input to a word recognizer (140) 
to determine a word or part of a word of the sampled tonal sound signal. 
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METHOD AND RECOGNIZER FOR RECOGNIZING TONAL 
ACOUSTIC SOUND SIGNALS 

Field of the Invention 

5 

The present invention relates, in general, to sound 
recognition, and in particular, to sound recognition of tonal 
acoustic sound signals. 

1 o Background of the Invention 

In the complex technology of speech recognition, one of 
the most complex and difficult challenges is to recognize sounds 
spoken in a language having tonal fluctuations and 

1 5 voiced/unvoiced sounds. In languages such as Latin or 

Germanic based languages or Japanese, tones are not a problem. 
These languages may be spoken in a monotone voice, and 
although the effect is uninteresting, the meaning is the same as 
if inflexion was added. This is not the case in tonal languages 

2 0 such as Chinese. 

Mandarin Chinese is generally understood to have 5 
tones. A first tone is monotone, a second rises, a third falls for 
a short time then rises, and a fourth falls. The length of the 
tones vary with the first usually being the longest, the second 

2 5 and third tones are usually similar in length of time, and the 

fourth is generally the shortest. The fifth tone, although not 
actually a tone, is neutral and is used on some syllables that are 
suffix to words. 

As with nearly all languages, Mandarin uses voiced and 

3 0 unvoiced sounds. A voiced sound is one generated by the vocal 

cords opening and closing at a constant rate giving off pulses of 
air. The distance between the peaks of the pulses is known as 
the pitch period. An example of voiced sounds would be an "i" 
sound as found in the word "pill". An unvoiced sound is one 
3 5 generated by a single rush of air which results in turbulent air 
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flow. Unvoiced sounds have no defined pitch. An example of 
an unvoiced sound is the "p" sound as found in the word "pill". 
A combination of voiced and unvoiced sounds can also be found 
in the word "pill" as the "p" requires the single rush of air while 
5 the "ill" requires a series of air pulses. 

Although essentially alt languages use voiced and 
unvoiced sounds, recognition of tonal languages are particularly 
difficult because the tone occurs only on the voiced segments of 
the words. 

10 Speech recognition for tonal languages such as Mandarin 

in conventional speech recognizers usually attempt to estimate 
pitch frequency. First, the speech recognizer must determine if 
the sound was voiced or unvoiced. This is performed using a 
voicing detector. Unfortunately, detecting voicing is very 

1 5 difficult and the process is prone to error. Even if the voicing is 

decided correctly, the recognizer must next estimate the pitch 
frequency to determine tone using a pitch estimator. 
Conventional pitch estimators generally use a process based 
upon auto-correlation, although other methods are used such as 

2 0 auto-regression methods. 

Auto correlation is essentially the matching of peaks in a 
sampled signal such as a sound signal. A sampled digital 
speech signal is divided into segments. The auto correlation 
method then compares the sampled speech signal for each 

2 5 segment with the sampled speech signal over all values of t for 

the same segment at time t-x for negative, positive and zero 
values of T, where x is a time lag. For x=0, the signal will be 
compared with itself providing a measure of signal power. The 
auto correlation function has a peak value at x when the signal 

3 0 at time t aligns well with the signal at time t-x. The equation 

for auto correlation is: 

R xx (x) - L Xi(t) Xi(t - t) 
where R xx (t) is the auto correlation at a lag x, Xi(t) is the speech 
signal at a segment "i" and time "t", xj(t - t) is the speech signal 
3 5 segment "i" and time (t - x) (a version of x 4 (t) delayed by x 
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samples), and I is the summation over a range of times at a 
given lag x. 

After the auto correlation has established peak values 
indicating alignment between the sampled signal at t and the 
5 sampled signal at (t - X) for the same segment, one of the peaks 
is selected as a potential pitch frequency and predetermined or 
established constraints are used to decide if the chosen peak 
corresponds to a pitch frequency. Often a peak is selected as a 
potential frequency by sorting the peak values from the auto 
1 0 correlation process from lowest to highest and picking the 
highest value. The criteria for determining if this potential 
peak value is the correct pitch frequency is if the peak value is 
reasonable given the established constraints. These constraints 
will always vary but generally include whether the frequency 

1 5 of the peak is reasonably in the range known to be produced 

by human speech and whether the frequency of the peak is 
reasonable given previous pitch frequencies. 

This process of defining rules and constraints is known as 
a heuristic method of analysis. Unfortunately, a heuristic 

2 0 method when applied to pitch frequency estimation is prone to 

error. For instance, the process may not distinguish between 
an actual peak and a harmonic. Additionally, there is never a 
clear definition of what a good rule is, and those rules that are 
established generally do not apply in all circumstances. 

2 5 An example of a voiced tonal sound signal would be the 

syllable Ta3 representing the pronunciation of the letters u ta" 
with tone 3 mentioned above. Ta3 has both voice and unvoiced 
sounds as well as a moving tone. Part of Ta3 has pitch and part 
does not. To adequately analyze Ta3, many rules must be 

3 0 generated which would have any number of exceptions to the 

rules. 

The complex sets of changing rules result in high 
probabilities of error. Coupled with a requirement to 
determine properly whether a syllable is voiced or unvoiced, 
3 5 the conventional tonal sound recognizers cannot supply a high 
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degree of accuracy in a reasonable manner. Additionally, tone 
is only one input to a sound recognizer that recognizes words or 
phrases from the input voiced tonal sound signal. Each of the 
inputs has a probability of error, greater or smaller. As each of 
5 these inputs bring error into the process, the probability of 
accuracy drops. Therefore, it is necessary to increase the 
probability of accuracy of the tonal recognition in order to help 
increase the probability of accuracy in the sound recognizer. 

I 0 Brief Description of the Drawings 

FIG. 1 is a block diagram of a tonal sound recognizer 
according to a preferred embodiment of the present invention. 

FIG. 2 is a flow diagram of the tonal sound recognizer of 
1 5 FIG. 1 in accordance with the preferred embodiment of the 
present invention. 

FIG. 3 is a flow diagram of the calculation of feature 
vectors according to a preferred embodiment of the present 
invention. 



20 



Detailed Description of the Invention 



Key to building a tonal sound recognizer that increases 
the probability of accuracy is eliminating voicing recognizers 

2 5 and eliminating the need for determining pitch frequency. To 

do this, there must be a way to compute tone trends for all 
sounds, voiced and unvoiced. Additionally, peak picking must 
be avoided thus eliminating the need for the complex and 
cumbersome peak picking rules. 

3 0 The present invention eliminates the voicing recognizer 

and the complex sets of rules of conventional tonal recognizers 
by tracking the trend, or movement, of the tone to determine if 
it is rising, falling, constant, or some combination of these. The 
trend is determined from segment to segment of the sampled 
3 5 sound signal. 
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FIG. 1 shows a tonal sound recognizer 100 according to 
the present invention. Referencing also the flow chart of FIG. 2 
showing the basic operation of the present invention, tonal 
sound recognizer 100 receives a sampled sound signal in analog 
5 form and converts the sampled sound signal into a digital signal 
in ADC 110. The digital sampled sound signal is then sent to 
feature vector computing device 120 where the sampled sound 
signal is segmented into a number of segments, preferably into 
segments equal in width, for analysis. Feature vectors for each 

10 segment of the sampled sound signal are computed (210 of FIG. 
2) in feature vector computing device 120. The feature vectors 
are vectors containing information that describes the tone 
trend of the sampled sound signal. 

In the preferred embodiment of the present invention, 

1 5 each segment is an analysis frame where a sampled signal is 
divided into any number of analysis frames. These analysis 
frames may overlap, be joined end-to-end, or be spaced apart, 
depending upon the needs of the design. In the preferred 
embodiment, the analysis frames overlap. 

20 Each of the analysis frames are multiplied by samples of 

a window function or weighting function. Any window function 
may be applied. In the preferred embodiment, the values of 
the hamming window are described by 



= 0.54-0. 46 cos 



V ^ window I > 



2 5 where w is the window function at sample index "index" * n d 

l ndti 

"lwindow 11 is the length of the window. 

In the preferred embodiment, the feature vectors for 
each analysis frames are Cepstral vectors or Cepstral 
coefficients. To obtain the Cepstral coefficients of an analysis 

3 0 frame, with reference to FIG. 3, the Fourier transform of the 

sampled sound signal in an analysis frame (s(t)) is calculated 
(310) to yield the spectrum (s(w)) of the sampled sound signal 
in the analysis frame. The spectrum is then squared (320) 
yielding the power spectrum (3 (to)) and the log of the power 
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spectrum is calculated (£(w)) (330). At this point. £(0) 
(frequency equals zero) is set to zero, or £(0) = 0 (340). The 
zero element of the power spectrum is a measure of the DC 
energy in the signal and carries no information related to tone. 
5 Therefore, C(0) is eliminated from the analysis. 

After eliminating C(0) from tne analysis, the inverse 
Fourier transform is calculated for C(w) (350) to obtain the 
Cepstral coefficients or Cepstral vector C(n). 

Referring again to FIG. 1, the feature vectors calculated in 
1 0 feature vector computing device 120 are sent to cross- 
correlator 130. Cross-correlator 130 compares the feature 
vectors in any one of the segments (a first) with the feature 
vectors of the next segment (a second) in time to determine if 
the tone of the sampled tonal sound signal is moving up, down, 

1 5 staying constant, or some combination of the three (220 of FIG. 

2) between the segments. For instance, the feature vectors of a 
first segment would be compared with the feature vectors of a 
second segment to determine the direction of movement 
between the segments over time. The process of comparing 

2 0 feature vectors in different segments is known as the cross- 

correlation method and is defined by the equation: 

R xyi (x)=X FVj(t) FV (i+1) (t-T) 
where R xy i(x) is the cross-correlation value R xy at one of the 
number of segments "i" for a sampled tonal sound signal at a 

2 5 lag "x", FVi(t) is one of the feature vectors at the "i th " one of the 

number of segments and at time "t", FV (i+1) (t - T) is one of the 
feature vectors at the "i+l" one of the number of segments and 
at time t - X, and I is a summation of FVj(t) FV (i+ i)(t - T) for a 
range of values of "t". The range is determined to be that range 

3 0 generally recognized to contain tonal trend information. 

The lag "x" is a shift in time in either direction from time 
"t" of the analyzed segment. 

The cross-correlation method defines peaks which 
correlate from segment to segment. These peaks are 
3 5 conceptually plotted segment by segment, or in other words. 
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the correlated peaks of the segments are lined up over time. 
The position of the peaks in the segments will define the 
direction the tone of the sampled tonal sound signal is moving. 
For instance, if correlating peaks from a first segment to a 
5 second segment are moving to the right, the tone of the sample 
tonal sound signal is moving up. If the peaks from segment to 
segment are moving left, the tone is moving down. If the peaks 
are generally in the same location from segment to segment, 
the tone is staying the same. Cross-correlator 130 tracks a 
1 0 trend of movement of the tone (230 of FIG. 2). Using the 

Mandarin tones as an example, a sampled tonal sound signal 
having peaks moving to the right would define a second tone as 
discussed previously. Movement to the left for a short time 
and then back to the right would define the third tone 

1 5 discussed earlier. No movement, or insignificant movement 

would define the first tone, whereas movement only to the left 
over time would define the fourth previously discussed tone. 

The trend of the tone is an element used by a word 
recognizer such as word recognizer 140 to recognize a word or 

2 0 syllable. Word recognizer 140, according to the preferred 

embodiment of the present invention, recognizes sounds or 
words in a tonal language using the tone trend as one input 
with other acoustic features to the recognition process. Word 
recognizer 140 then determines a word or part of a word using 

2 5 the input information. 

There is no recognizable trend in tone in unvoiced 
segments of a tonal language over time. Instead, any tone 
pattern will be randomly moving about. Word recognizer 140 
can identify random movement and will categorize the 

3 0 segments having random movement as unvoiced segments. 

By computing feature vectors for a sampled tonal sound 
signal, such as Cepstral vectors, and comparing these feature 
vectors across segments of the sample tonal sound signal over 
time, the movement of the tone may be determined which 
3 5 defines a tone trend. The tone trend is then used in place of 
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the prior art combination of a voicing recognizer and a complex 

set of peak picking rules as an input to the recognition process. 
The process of the present invention as described above 

may be stored on an article of manufacture such as a computer 
5 diskette or on a storage medium of a computer to allow the 

computer to access and perform the method of the preferred 

embodiment of the present invention. 

It should be recognized that the present invention may 

be used in many different sound recognition systems. All such 
10 varied uses are contemplated by the present invention. 
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CLAIMS. 

1. A method comprising: 

computing feature vectors for a number of 
5 segments of a sampled tonal sound signal wherein the feature 
vectors contain information describing a tonal trend of the 
sampled tonal sound signal; 

comparing the feature vectors of a first of the 
number of segments with the feature vectors of a second of the 

1 0 number of segments to determine a trend of a movement of a 

tone of the sampled tonal sound signal; and 

using the trend as an input to determine a word or 
part of a word of the sampled tonal sound signal. 

15 2. A method according to claim 1 wherein the step of 
comparing the feature vectors of a first of the number of 
segments with the feature vectors of a second of the number of 
segments is a cross-correlation method. 

2 0 3. A method according to claim 2 wherein the cross- 

correlation method uses an equation: 

R xyi (T) « L FVi(t) FV (i+ i)(t - T) 

where R X yiOO is a cross-correlation value R xy at one 
of the number of segments "i" for a sampled tonal sound signal 
2 5 at lag "x'\ FVi(t) is one of the feature vectors at an "i th " one of 
the number of segments at time "t'\ FV (i+ i)(t - x) is one of the 
feature vectors at an "i+l" one of the number of segments at 
time t — x, and X is a summation of FVj(t) FV (i+ i)(t - x) for 

values of "t M . 

30 

4. A method according to claim 1 wherein the step of 
computing feature vectors for a number of segments of a 
sampled tonal sound signal comprises computing Cepstral 
vectors for each of the number of segments. 

35 



WO 97/40491 



PCTAJS97/06075 



10 

5. A method according to claim 4 wherein the step of 
computing Cepstral vectors for each of the number of segments 
comprises the steps of: 

calculating a Fourier transform of the sampled tonal 
5 sound signal to obtain a spectrum and squaring the spectrum to 
obtain a power spectrum; 

calculating a log of the power spectrum; and 
calculating an inverse Fourier transform of the 
power spectrum to obtain the Cepstral vectors for each of the 

1 0 number of segments. 

6. A method according to claim 5 wherein the step of 
calculating a log of the power spectrum further includes setting 
the log of the power spectrum at a frequency of zero to equal 

1 5 zero. 

7. A device comprising: 

a feature vector computing device for computing 
feature vectors for a number of segments of a sampled tonal 

20 sound signal; 

a cross-correlator for comparing the feature vectors 
of a first of the number of segments with the feature vectors of 
a second of the number of segments to determine a trend of a 
movement of a tone of the sampled tonal sound signal; and 

2 5 a word recognizer for using the trend as an input to 

determine a word or part of a word of the sampled tonal sound 
signal. 

8. An article of manufacture having stored thereon data and 

3 0 instructions which, when loaded into a computer, cause the 

execution of the steps of: 

computing feature vectors for a number of 
segments of a sampled tonal sound signal wherein the feature 
vectors contain information describing a tonal trend of the 
3 5 sampled tonal sound signal; 
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comparing the feature vectors of a first of the 
number of segments with the feature vectors of a second of the 
number of segments to determine a trend of a movement of a 
tone of the sampled tonal sound signal; and 
5 using the trend as an input to determine a word or 

part of a word of the sampled tonal sound signal. 
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